The status quo approach to training object detectors requires expensivebounding box annotations. Our framework takes a markedly different direction:we transfer tracked object boxes from weakly-labeled videos to weakly-labeledimages to automatically generate pseudo ground-truth boxes, which replacemanually annotated bounding boxes. We first mine discriminative regions in theweakly-labeled image collection that frequently/rarely appear in thepositive/negative images. We then match those regions to videos and retrievethe corresponding tracked object boxes. Finally, we design a hough transformalgorithm to vote for the best box to serve as the pseudo GT for each image,and use them to train an object detector. Together, these lead tostate-of-the-art weakly-supervised detection results on the PASCAL 2007 and2010 datasets.
展开▼